Formal Properties of XML Grammars and 

Languages 

Jean Berstel 

Institut Gaspard Monge (IGM) 
Universite de Marne-la-Vallee 
5, boulevard Descartes, 77454 Marne-la-Vallee Cedex 2 

Luc Boasson 

Laboratoire d'informatique algorithmiquc: fondements et applications (LIAFA) 
Universite Denis-Diderot (Paris VII) 
2, place Jussieu, 75251 Paris Cedex 05 

February 1, 2008 

Abstract 

XML documents are described by a document type definition (DTD). 
An XML-grammar is a formal grammar that captures the syntactic 
features of a DTD. We investigate properties of this family of gram- 
mars. We show that every XML-language basically has a unique 
XML-grammar. We give two characterizations of languages gener- 
ated by XML-grammars, one is set-theoretic, the other is by a kind of 
saturation property. We investigate decidability problems and prove 
that some properties that are undecidable for general context-free lan- 
guages become decidable for XML-languages. We also characterize 
those XML-grammars that generate regular XML-languages. 

Resume 

Les documents XML sont decrits par une definition de type de doc- 
ument (DTD). Une grammaire XML est une grammaire formelle qui 
retient les aspects syntaxiques d'une DTD. Nous etudions les pro- 
prietes de cette famille de grammaires. Nous montrons qu'un langage 
XML a essentiellement une seule grammaire XML. Nous donnons deux 
caracterisations des langages engendres par les grammaires XML, la 
premiere est ensembliste, la deuxieme est par une propriete de sat- 
uration. Nous examinons des problemes de decision et nous prou- 
vons que certaines proprietes qui sont indecidables pour les langages 
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context-free generaux deviennent decidables pour les langages XML. 
Nous caracterisons egalement les grammaires XML qui engendrent des 
langages rationnels. 

1 Introduction 

XML (extensible Markup Language) is a format recommended by W3C in 
order to structure a document. The syntactic part of the language describes 
the relative position of pairs of corresponding tags. This description is by 
means of a document type definition (DTD). In addition to its syntactic part, 
each tag may also have attributes. If the attributes in the tags are ignored, a 
DTD appears to be a special kind of context-free grammar. The aim of this 
paper is to study this family of grammars. 

One of the consequences will be a better appraisal of the structure of 
XML documents. It will also illustrate the kind of limitations that exist in 
the power of expression of XML. Consider for instance an XML-document 
that consists of a sequence of paragraphs. A first group of paragraphs is 
being typeset in bold, a second one in italic. It is not possible to specify, by 
a DTD, that in a valid document there are as many paragraphs in bold than 
in italic. This is due to the fact that the context-free grammars corresponding 
to DTD's are rather restricted. 

As another example, assume that, in developing a DTD for mathematical 
documents, we require that in a (full) mathematical paper, there are as 
many proofs as there are statements, and moreover that proofs appear always 
after statements (in other words, the sequence of occurrences of statements 
and proofs is well-balanced). Again, there is no DTD for describing this 
kind of requirements. Pursuing in this direction, there is of course a strong 
analogy of pairs of tags in an XML document and the \begin{ object} and 
\end{ object} construction for environments in Latex. The Latex compiler 
merely checks that the constructs are well-formed, but there is no other 
structuring method. 

The main results in this paper are two characterizations of XML-langua- 
ges. The first (Theorem 4.2) is set-theoretic. It shows that XML-languages 
arc the biggest languages in some class of languages. It relies on the fact that, 
for each XML-language, there is only one XML- grammar that generates it. 
The second characterization (Theorem 4.4) is syntactic. It shows that XML- 
languages have a kind of "saturation property" . 

As usual, these results can be used to show that some languages cannot 
be XML. This means in practice that, in order to achieve some features of 
pages, additional nonsyntactic techniques have to be used. 
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The paper is organized as follows. The next section contains the defini- 
tion of XML- grammars and their relation to DTD. Section 3 contains some 
elementary results, and in particular the proof that there is a unique XML- 
grammar for each XML-language. It appears that a new concept plays an 
important role in XML-languages: the notion of surface. The surface of an 
opening tag a is the set of sequences of opening tags that are children of a (i. 
e. the tags immediately under a that may follow a in a document before the 
closing tag a is reached) . The surfaces of an XML-language must be regular 
sets, and in fact describe the XML-grammar. The characterization results 
are given in Section 4. They heavily rely on surfaces, but the second also 
uses the syntactic concept of a context. 

Section 5 investigates decision problems. It is shown that is is decidable 
whether the language generated by a context-free language is well-formed, 
but it is undecidable whether there is an XML-grammar for it. On the 
contrary, it is decidable whether the surfaces of a context-free grammar are 
finite (Section 6). 

Section 7 is concerned with regular XML-languages. It appears indeed 
that most XML-languages used in practical apphcations are regular. We 
show that, for a given regular language, it is decidable whether it is an XML- 
language, and we give a structural description of regular XML-grammars. 

The final section is a historical note. Indeed, several species of context- 
free grammars investigated in the sixties, such as parenthesis grammars or 
bracketed grammars are strongly related to XML-grammars. These relation- 
ships are sketched. 

A preliminary version of this paper appears in the proceeding of the 
MFCS 2000 conference [1]. 

2 Notation 

An XML document [8] is composed of text and of tags. The tags are opening 
or closing. Each opening tag has a unique associated closing tag, and con- 
versely. There are also tags called empty tags, and which are both opening 
and closing. These tags may always be replaced by an opening tag immedi- 
ately followed by its closing tag. We do so here, and therefore assume that 
there are no empty tags. 

Let A be a set of opening tags, and let A be the set of corresponding 
closing tags. Since we are interested in syntactic structure, we ignore any 
text. Thus, an XML document (again with any attribute ignored) is a word 
over the alphabet T = AU A. 

A document x is well-formed if the word x is a correctly parenthesized 
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word, that is if x is in the set of Dyck primes over AU A. Observe that the 
word is a prime, so it is not a product of two well parenthesized words. Also, 
it is not the empty word. 

An XML-grammar is composed of a terminal alphabet T = AU A, of a 
set of variables V in one-to-one correspondence with A, of a distinguished 
variable called the axiom and, for each letter a E A of a. regular set Ra C V* 
defining the (possibly infinite) set of productions 

Xa — > ama, m e Ra, a & A 

We also write for short 

Xa — aRaO, 

as is done in DTD's. An XML-language is a language generated by some 
XML-grammar . 

It is well-known from formal language theory that non-terminals in a 
context-free grammar may have infinite regular (or even context-free) sets of 
productions, and that the generated language is still context-free. Thus, any 
XML-language is context-free. Moreover, it is a deterministic context-free 
language in the sense that there is a deterministic push-down automaton 
([4]) recognizing it. 

Example 2.1 The language {a"a" | n > 0} is a XML-language, generated 

by 

X a{X\e)a 

Example 2.2 The language of Dyck primes over {a, a} is a XML-language, 
generated by 

X aX*a 

Example 2.3 The language Da of Dyck primes over T = AUA is generated 
by the grammar 

Xa aX*a, ae A 

It is not an XML-language. However, each Xa in this grammar generates an 
XML-language, which is D f] aTa. 

In the sequel, all grammars are assumed to be reduced, that is, every 
non-terminal is accessible from the axiom, and every non-terminal produces 
at least one terminal word. Note that for a regular (or even a recursive) set 
of productions, the reduction procedure is effective. 
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Given a grammar G over a terminal alphabet T and a nonterminal X we 
denote by 

Lg{X) ^{weT* \X^w} 
the language generated by X in the grammar G. 

Remark 2.4 The definition has the following correspondence to the termi- 
nology and notation used in the XML community ([8]). The grammar of 
a language is called a document type definition (DTD). The axiom of the 
grammar is qualified DOCTYPE, and the set of productions associated to a tag 
is an ELEMENT. The syntax of an element implies by construction the one-to- 
one correspondence between pairs of tags and non-terminals of the grammar. 
Indeed, an element is composed of a type and of a content model. The type 
is merely the tag name and the content model is a regular expression for 
the set of right-hand sides of the productions for this tag. For instance, the 
grammar 

S a{S\T){S\T)a 
T bT*b 

with axiom S corresponds to 

<! DOCTYPE a [ 

<!ELEMENT a ( (a I b) , (a| b) ) > 
<! ELEMENT b (b)* > 

]> 

Here, S and T stand for the nonterminals Xa and Xf, respectively. 

The regular expressions allowed for the content model are of two types: 
those called children, and those called mixed [8]. In fact, since we do not 

consider text, the mixed expressions are no more special expressions. 

In the definition of XML-grammars, we ignore entities, both general and 
parameter entities. Indeed, these may be considered as shorthand and are 
handled at a lexical level. 

Remark 2.5 In the recent specification of XML Schcmas ([9]), a DTD is 
called a schema. The syntax used for defining schemas is XML itself. Among 
the most significant enrichment of schema is the use of types. Also the purely 
syntactical part of XML schemas is more evolved than that of DTD's. 

3 Elementary Results 

We denote by Da the language of Dyck primes starting with the letter a. 
This is the language generated by Xa in Example 2.3. We set Da = UaeADa- 
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This is not an XML-language if A has more than one letter. We call Da the 
set of Dyck primes over A and we omit the index A if possible. The set D is 
known to be a bifix code, that is no word in D is a proper prefix or a proper 
suffix of another word in D. 

Let L be any subset of the set D of Dyck primes over A. The aim of 
this section is to give a necessary and sufficient condition for L to be an 
XML-language. 

We denote by F{L) the set of factors of L, and we set Fa{L) — Dar\F{L) 

for each letter a E A. Thus Fa{L) is the set of those factors of words in L 
that are also Dyck primes starting with the letter a. These words are called 
well-formed factors. 

Example 3.1 For the language 

L = {ab^''¥''a I n > 1} 
one has F„(L) = L and Fb{L) = | n > 1}. 

Example 3.2 Consider the language 

L = {a{bb)''{cc)''a \n>l} 
Then Fa{L) = L, F,{L) = {66}, = {cc}. 

The sets Fa{L) are important for XML- languages and grammars, as illus- 
trated by the following lemma: 

Lemma 3.3 Let G be an XML-grammar over A\J A generating a language 

L, with nonterminals Xa, for a E A. For each a E A, the language generated 
by Xa is the set of factors of words in L that are Dyck primes starting with 
the letter a, that is 

LoiXa) = Fa{L) 

Proof. Set T — A\J A. Consider first a word w e La{Xa). Clearly, w is in 
Da- Moreover, since the grammar is reduced, there are words g,d in T* such 
that X — > gXad, where X is the axiom of G. Thus -u; is a factor of L. 

Conversely, consider a word w e Fa{L) for some letter a, let g,d he a 
words such that gwd e L. Due to the special form of an XML-grammar, any 
letter a can only be generated by a production with non-terminal Xa- Thus, 
a left derivation X — > gwd factorizes into 

X^gXaP^gwd (1) 
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for some word /3, where k is the number of letters in g that are in A. Next 



gXaP^gw'(3^gwd (2) 

with Xa — ^ w' and w' G D. None of w and w' can be a proper prefix of the 
other, because D is bifix. Thus w' — w. This shows that w is in LdXa) and 
proves that Fa — LaiXa). ■ 

CoroUeiry 3.4 For any XML-language L C -Da, one has Fa{L) — L. ■ 

Let w be a Dyck prime in Da- It has a unique factorization 

W = aUa^Ua^ ■ ■ -Uaji 

with ■Uq- G Da^ for i = 1, . . . , n. The irace of the word w is defined to be the 
word aia2 • • ■ a„ G A*. 

If L is any subset of D, and w E L, then the words i^aj are in F^. (L). The 
surface of a G A in L is the set Sa{L) of all traces of words in Fa{L). 

Example 3.5 For the language of Example 3.1, the surfaces are easily seen 
to be Sa = {h} and = {6, e}. 

Example 3.6 The surface of the language of Example 3.2 are Sa — {b^c^ \ 
n > 1} and Sf, — Sc — {s}. 

It is easily seen that the surfaces of the set of Dyck primes over A are all 
equal to A*. 

Surfaces are useful for defining XML-grammars. Let S = {Sa \ a E A} 
be a family of regular languages over A. We define an XML-grammar G 
associated to S called the standard grammar of S as follows. The set of 
variables is V = {Xa \ a E A}. For each letter a, we set 

Ra — {^ai^a2 ' ' ' -^a„ | 0'lO'2 • • • Cln ^ So) 

and we define the productions to be 

Xa amd, m E Ra 

for all a e A. Since Sa is regular, the sets Ra are regular over the alphabet 
V. By construction, the surface of the language generated by a variable Xa 
is Sa, that is SaiLciXa)) — Sa- For any choice of the axiom, the grammar is 
an XML-grammar. 
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Example 3.7 The standard grammar for the surfaces of Example 3.1 is 

X, ^ b{X,\e)b 

The language generated by Xa is {a6"'6"a | n > 1} and is not the language 
of Example 3.1. 

This construction is in some sense the only way to build XML-grammars, 
as shown by the following proposition. 

Proposition 3.8 For each XML-language L, there exists exactly one reduced 
XML-grammar generating L, up to renaming of the variables. 

Proof . Let G be an XML-grammar generating L, with nonterminals V = 
{Xa \ a e A}, and Ra — {m e V* \ Xa — >■ ama} for each a e A. We claim 
that the mapping 

H- >■ aia2 ■ ■ - cin (*) 

is a bijection from Ra onto the surface Sa{L) for each a E A. Since the 
surface depends only on the language, this suffices to prove the proposition. 
It is clear that (*) is a bijection from V* onto A*. It remains to show that 
its restriction to Ra is onto Sa{L). 
If 

Xa ^ 0'XaiXa2 ' ' " Xa„a 

is a production, then aia2 ■ ■ ■ is the trace of some word u in LciXa). By 
Lemma 3.3, the word u is in Fa{L), and thus aia2 ■ ■ ■ a„ is in Sa{L). 

Conversely, if aia2 • • - On is in Sa{L), then there is a word w e Fa{L) — 
LaiXa) such that 

w — auiU2 ■ ■ ■ Una 
with Ui e Dai ■ Thus, there is a derivation 

Xa — > ama — ^ w 

in G. Setting m = Y1I2 ■ ■ - Yk with Yi, . . . ,Yk & V, there are words u[, . . . 
such that Yi — ^ u[ and 

Ui ■ ■ ■ Un = u'l ■ ■ ■ Uf. 

However, each Ui^u'^ is a Dyck prime, and since the sets of Dyck primes 
are codes, it follows that n = k and Ui = u\ for i = 1, . . . ,n. Since the 
words Ui are in Fai{L), there are derivations — ^Uj. Thus Y^ — Xa^ and 
m = Xaj^Xa^ ■ ■ ■ Xa„ BS required. ■ 
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Remark 3.9 Obviously, Proposition 3.8 is not longer true if entities are 
allowed. Indeed, entities may be used to group sets of productions in quite 
various manners. 

Corollary 3.10 Let Li and L2 be two XML-languages. Then Li C L2 iff 
Sa{Li) C Sa{L2) for all a in A. 

Proof. The condition is clearly necessary, and by the previous construction, 
it is also sufficient. ■ 

Proposition 3.11 The inclusion and the equality of XML-languages is de- 
cidable. 

Proof. This follows directly from Corollary 3.10. ■ 

In particular, it is decidable if an XML-language L is empty. Similarly, 
it is decidable if L = Da. 

XML-languages are not closed under union and difference. This will be 
an easy example of the characterizations given in the next section (Exam- 
ple 4.10). 

The following proposition is interesting from a practical point of view. 
Indeed, it shows that a stepwise refinement technique can be used in order 
to design a DTD that satisfies or at least approaches a given specification. 

Proposition 3.12 The intersection of two XML-languages is an XML-lan- 
guage. 

Proof. Let L and L' be XML- languages generated by XML-grammars G and 
G'. We define an new grammar G x G'with set of variables V x V and 
productions 

{X,X')^a{Xi,X[)---{Xr„X'Ja 

if and only if X — > aXi ■ ■ ■ X^a in G and X' — > aX[ ■ ■ ■ X'^a. The inclusion 
LgxG'{X-,X') C Lg{X) nLc'iX') is clear. Conversely, assume w G Lq{X) fl 
Lqi {X') . Then X — > aXi ■ ■ ■ X„a — ^ w in G and X' — > aX[ ■ ■ ■ X'^iO, w 
in G'. Thus w = aui ■ ■ -UnCi = au[ ■ ■ -u'^/O,, where Xj -^Ui and X'^ -~>u[. 
Since the set of Dyck primes is a code, one has n = n' and Ui = u[. Thus 
Ui e LaiXi) n Lg{XI) and the results follows by induction. ■ 
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4 Two Characterizations of XML- languages 



In this section, we give two characterizations of XML-language. The first 
(Theorem 4.2) is based on surfaces. It states that, for a given set of regular 
surfaces, there is only one XML-language with these surfaces, and that it 
is the maximal language in this family. The second characterization (Theo- 
rem 4.4) is syntactical and based on the notion of context. 

Let S = {Sa I a G A}, be a family of regular languages, and fix a letter 
Oo in A. Define C{S) to be the family of languages L C -Dao such that 
Sa{L) = Sa for all a in A. Clearly, any union of sets in C{S) is still in JC{S), 
so there is a maximal language (for set inclusion) in this family. The standard 
language associated to S is the language generated by in the standard 
grammar of S. 

Lemma 4.1 Let L be the standard language of S. For any language M in 
jC{S), one has F„o(M) C L. 

Proof. Let G be the standard grammar of S. Then L = LciXa^). We show 
that Fa{M) C LciXa) for a E A hj induction on the length of words. Let 
w = aua G Fa{M). If u is the empty word, then the empty word is in Sa, 
and the word aa is in LoiXa). Otherwise, u has a (unique) factorization 

U = Ua^--- Ua„ 

with Ua^ G Fa^{M) ioT i — 1, . . . ,n. By induction, G LaiXa^) for i — 
1, . . . , n. Since oi ■ ■ ■ a„ G Sa, there is a production Xa aXa^ ■ ■ ■ Xa„a in 
the grammar. Thus w is in LG{Xa). The result follows. ■ 

Theorem 4.2 The standard language associated to S is the maximal element 
of the family C{S) . This language is XML, and it is the only XML-language 
in the family J0{S). 

Proof. The first part is just Lemma 4.1 and the second part is Proposi- 
tion 3.8. ■ 

Example 4.3 The standard language associated to the sets Sa = {b} and 
Sb = {b,e} of Example 3.1 is the language {a6"6"a | n > 1} of Example 3.7. 
Thus, the language of Example 3.1 is not XML. 

We now give a more syntactic characterization of XML-languages. For this, 
we define the set of contexts in L of a word w as the set Cl{w) of pairs of 
words {x, y) such that xwy G L. 



10 



Theorem 4.4 A language L over AU A is an XML-language if and only if 

(i) L C Da for some a & A, 

(ii) for all a & A and w,w' e Fa{L), one has Cl{w) — Cl{w'), 

(iii) the set Sa{L) is regular for all a & A. 

Before giving tiic proof, let us compute one example. 

Example 4.5 Consider the language L generated by the grammar 

S aTTa 
T aTTa \ bb 

with axiom S. This grammar is not XML. Clearly, L C Da- Also, Fa{L) = L. 
There is a unique set Cl{w) for all w & L, because at any place in a word 
in L, a factor w in L can be replaced by another factor w' in L. Finally, 
Sa{L) = (a U 6)^ and Sb{L) — {e}. The theorem claims that there is an 
XML-grammar generating L. 

Proof. We write Fa, Sa and C{w), with the language L understood. We first 
show that the conditions are sufficient. 

Let G be the XML-grammar defined by the family Sa and with axiom 
Xq,. We prove first Lci^Xa) = Fa for a e A. By Lemma 4.1, Fa C Lc{Xa). 
Next, we prove the inclusion Fa D LdXa) by induction on the derivation 

k 

length k. Assume Xa — >w. Then w = aua for some word u. If k = 1, then 
the empty word is in Sa, which means that aa is in Fa . If /c > 1, then the 
derivation factorizes in 

_ k—i _ 
Xa aXa^ ■ ■ ■ Xa„d — > ttud 

for some production Xa — > aXa^---Xa^a. Thus there is a factorization 
u = Ui ■ ■ ■ Un such that Uj G LciXa^) for i = 1, . . . , n. By induction, Ui G -Fq- 
for i = l,...,n. Moreover, the word ai---an is in the surface Sa- This 
means that there exist words u[ in F^. such that the word w' — au'-^^ ■ ■ ■ u'^d is 
in Fa. Let g, d be two words such that gw'd is in the language L. Then the 
pair [ga, u^ - ■ ■ u'^ad) is a context for the word u'^^. By (ii), it is also a context 
for Ui- Thus auiu'2 ■ ■ ■ u[fl is in F^. Proceeding in this way, on strips off all 
primes in the m's, and eventually auiU2 ■ ■ ■ Und is in F^. Thus w is in Fa. This 
proves the inclusion and therefore the equality. Finally, by Corollary 3.4, on 
has LoiXa) — L, and consequently the conditions are sufficient. 

We now show that the conditions are necessary. Let G be an XML- 
grammar generating L, with productions Xa aRad and axiom X^- Clearly, 
L is a subset of Da- Next, consider words w,w' & Fa for some letter a, and 
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let {g, d) be a context for w. Thus gwd G L. By Lemma 3.3, we know 
that Fa = LG{Xa)- Thus, there exist derivations Xa^-^w and Xa—^w'. 
Substituting the second to the first in 

Xa gXad gwd (3) 

shows that {g, d) is also a context for w' . This proves condition (ii). 

Finally, since Ra is a regular set, the set Sa is also regular. ■ 

Example 4.6 Consider the language L of Example 4.5. The construction 
of the proof of the theorem gives the XML-grammar 

Xa a{Xa I X^) {Xa \Xi,)a 
Xb bb 

Example 4.7 The language 

{a{bb)''{cc)''d I n > 1} 

already given above is not XML since the surface of a is the nonregular set 
Sa = {h'^c^ \ n > 1}. This is the formalization of the example given in the 
introduction, if the tag b means bold paragraphs, and the tag c means italic 
paragraphs. 

Example 4.8 In order to formalize the example of well-formed mathemat- 
ical papers given in the introduction, consider the language L = {aHa}, 
where H is the language obtained from the Dyck language over a single let- 
ter b by replacing every b by ti and every b by pp. Here, the letters t and i 
stand for <theorein> and </theorein> and p and p for <proof > and </proof > 
respectively. If one renames i as c and p as c, then the surface of a in the 
language L is the Dyck language over c, and it is not regular. 

Example 4.9 Consider again the language 

L = {ab^''F''d I n > 1} 

of Example 3.1. First dibb) = {{ab'^''-\aP"-^d) | n > 1}. Next CL{b%^) = 
{(a6^", afe^^a) | n > 0}. Thus there are factors with distinct contexts. This 
shows again that the language is not XML. 

Finally, we give an example showing that XML-languages are closed nei- 
ther under union nor under difference. 
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Example 4.10 Consider the sets cLc and cMc, where L = D^ab} 
of products of Dyck primes over {a, 6}, and M — is the set of products 

of Dyck primes over {a, d}. Each of these two languages is XML. However, 
the union H = LU M is not. Indeed, the words cabbac and caaddc arc both 
in H. The pair (c, ddc) is in the context of aa, so it has to be in the context 
of abba, but the word cabbdddc is not in H. Given a language L G Da, write 
L = Da — L for the relative complementation. Closure under difference would 
imply closure under relative complementation, and this would imply closure 
under union because LU M — Lf] M. Thus XML-languages are not closed 
under difference. 

5 Decision problems 

As usual, we assume that languages are given in an effective way, in general 
by a grammar or an XML-grammar, according to the assumption of the 
statement. 

Some properties of XML-languages, such as inclusion or equality (Propo- 
sition 3.11) are easily dccidable because they reduce to dccidable properties 
of regular sets. The problem is different if one asks whether a context-free 
grammar generates an XML-language. We have already seen in Example 4.5 
that there exist context-free grammars that generate XML-languages with- 
out being XML-grammars. We shall prove later (Proposition 5.3) that it is 
undccidable whether a context-free grammar generates an XML-language. 
On the contrary, and in relation with Theorem 4.4, it is interesting to note 
that it is dccidable whether a context-free language is a subset of the set of 
Dyck primes. The following proposition and its proof are an extension of a 
result by Knuth [5] who proved is for a single letter alphabet A. 

Proposition 5.1 Given a context-free language L over the alphabet ^4 U ^4, 
it is dccidable whether L C D\. 

We first introduce some notation. The Dyck reduction is the semi-Thue 
reduction defined by the rules aa e for a E A. A word is reduced or 
irreducible if it cannot be further reduced, that means if it has no factor of 
the form aa. Every word w reduces to a unique irreducible word denoted 
p{w). We also write w = w' when p{w) = p{w'). If -u; is a factor of some 
Dyck prime, then p{w) has no factor of the form ab, for a,b E A. Thus 
p{w) G A* A*. In fact, p{F{Da)) = A* A*. 

Proof of Proposition 5.1. Let G = {V,P,S) be a (reduced) context-free 
grammar (in the usual sense, that is with a finite number of productions) 
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over T = AU A, with axiom S G V, generating the language L. For each 
variable X, we set 

Irr(X) = {p{w) \X^w,w eT*} 

This is the set of reduced words of all words generated by X. Testing whether 
L is a subset of D\ is equivalent to testing whether Irr(S') = {e}. 

First, we observe that if Irr(S') = {e}, then Irr(X) is finite for each 
variable X. Indeed, consider any derivation S gXd with g,d E T*. 
Any u G Irr(X) is of the form u = xy, for x,y E A*. Since p{gud) = 
p{p{g)up{d)) — e, the word x is a suffix of p{g), and y is a prefix of p{d). 
Thus \u\ < \p{g) \ + \p{d)\, showing that the length of the words in Irr(X) is 
bounded. This proves the claim. 

A preliminary step in the decision procedure is to compute a candidate to 
the upper bound on the length of words in Irr(X). To do this, one considers 
any derivation S gXd^-^ gud with gud E T*, and one computes ix — 
\p{g) \ + \p{d)\. As just mentioned before, it is necessary that every reduced 
word in 1tt(X) has length at most £x- 

We now inductively construct sets Irrfc(X) as follows. We start with the 
sets Iyyq{X) = 0, for X E V, and we obtain the sets in the next step by 
substituting irreducible sets of the current step in the variables of the right- 
hand sides of productions. Formally, 

Irrfe+i(X) = IrVkiX) U |J p{ak{a)) 

where 0"^ is the substitution that replaces each variable Y by the set lrTk{Y). 
This construction is borrowed from [2] , with an addition use of the reduction 
map p at each step. It follows that Irr(X) = lJ^>Qlrrfc(X) 

For each k, one computes Irrfc(X) for all X E V, and then, one checks 
whether iTTk(X) = Irrfc_i(X) for all X. If so, the computation stops. The 
language L is a subset of if and only if Irrfc(S') = {e}. If Irrfc(X') 7^ 
Irrfc_i(X') for some X', then one checks whether all words in Irr/j(X) have 
length smaller than £x, for all X. If so, then one increases k. If the answer 
is negative, then L is not a subset of Da- 

Since the sets Irrfc(X) arc finite, and the length of its elements must be 
bounded by ix in order to continue, one eventually reaches a step where the 
computation stops. ■ 

Corollary 5.2 Given a context-free language L over the alphabet AUA and 
a letter a in A, it is decidable whether L <Z D^- 
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Proof. It is decidable whether L C a{A U A)*a (for instance by computing 
the set of first (last) letters of words in L. If this inclusion holds, then one 
effectively computes the language L' = a~^La~^ obtained by removing the 
initial a and the final d in all words of L. It follows by the structure of the 
Dyck set that L C Da if and only if L' C -D*. ■ 

The proof of the following proposition uses standard arguments. 

Proposition 5.3 It is undecidable whether a context-free language is an 
XML-language. 

Proof . Consider the Post Correspondence Problem (PCP) for two sets of 
words U = {ui, . . . , Un} and V = {vi, . . . , v„} over the alphabet C = {a, b}. 
Consider a new alphabet B = {ai, . . . , an} and define the sets Lj/ and Ly by 

Lu = {oii • • • di^h \ h Ui^- ■ - Ui^} = {ai^ ■ ■ ■ ai^h \ h^ Vi^- ■ - Vi^} 

Recall that these are context-free, and that the set L — LuU Ly is regular 
iff L = B*C*. This holds iff the PCP has no solution. 

Set A — {oi, . . . , On, a, b, c}, and define a mapping w from A* to {A U A) 
by mapping each letter d to dd. 

Consider words Ui, . . . , Un, Vi, . . . ,Vn in {ad, 66}"'" and consider the lan- 
guages 

Lu = {dhai^ ■ ■ ■ di^^i^h \ h^Ui^ - ■ ■ ui-^} 

and 

Lv = {OnOn • • • di^CLi^h \ h Vi^ - ■ -ViJ 

Set L — c{Lu U Lv)c. The surface of c in L is Sc{L) = Lu D Ly. If L is 
an XML-language, then Lu U Ly is regular which in turn implies that the 
PCP has no solution. Conversely, if the PCP has no solution, Lu U Ly is 
regular which implies that LjjULy = B*C*, which implies that L — cB*C*c, 
showing that L is an XML-language. ■ 

Corollary 5.4 Given a context-free subset of the Dyck set, it is undecidable 
whether its surfaces are regular. 

Proof. With the notation of the proof of Proposition 5.3, the surface Sc{L) 
of the language L is the language L, and L is regular iff the associated PCP 
has no solution. ■ 

Despite the fact that regularity of surfaces is undecidable, it appears that 
finiteness of surfaces is decidable. This is the main result of the next section. 
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6 Finite Surfaces 



There are several reasons to consider finite surfaces. First, the associated 
XML-grammar is then a context-free grammar in the strict sense, that is 
with a finite number of productions for each nonterminaL 

Second, the question arises quite naturally within the decidability area. 
Indeed, we have seen that it is undecidable whether a context-free language 
is an XML-language. This is due basically to the fact that regularity of 
surfaces is undecidable. On the other side, it is decidable whether a context- 
free language is contained in a Dyck language, and we will prove that it is 
also decidable whether the surfaces are finite. So, the basic undecidability 
result is the regularity of surfaces. 

Finally, XML-grammars with finite surfaces are very close to families of 
grammars that were studied a long time ago. They will be considered in the 
concluding section. 

Theorem 6.1 Given a context-free language L that is a subset of a set of 
Dyck primes, it is decidable whether L has all its surfaces finite. 

Corollary 6.2 Given a context-free language L that is a subset of a set of 
Dyck primes, it is decidable whether L is a XML-language with finite surfaces. 

In the rest of this section, we consider a reduced context-free grammar 
G with nonterminal alphabet V, and terminal alphabet T = AU A. The 
language L generated by G is supposed to be a subset of some set Da of 
Dyck primes. Recall that D — (J^g^-Da. If N is an integer such that F{L) 
is contained in — e\jD\jD'^\J---[J D^ , we say that L has hounded 
width. 

First, observe that L has finite surfaces iff it has bounded width. Indeed, 
if the surface Sa{L) is infinite for some a E A, then there are words of the 
form aui ■ ■ ■ UnO, in F{L) for infinitely many integers n, and clearly F{L) is 
not contained in any D^^\ Conversely, if i^i • • ■«„ G F{L), then there are 
words w,w' G D* such that awui ■ ■ -Unw'a G F{L). But then the trace of 
this word has length at least n. Thus if F(L) is not contained in D^^\ at 
least on surface is infinite. 

For the proof of the theorem, we investigate iterating pairs in G. We 
start with a lemma of independent interest. 

Lemma 6.3 IfX — ^ gXd for some words in g,d & AuA)*, then there exist 
words X, y,p,q G A* such that 

pig) = xpx, p{d) = yqy 

and moreover p and q are conjugate words. 
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Proof. The words g and d are factors of D. Thus, there exist words x, y,z,t ^ 
A* such that g = xz, d = ty. There is a word v such that g'^vd'" is a factor of 
D for each n> 0. From g^vd^ = xzxzvtyiy, one gets that x is a suffix of z or 
2; is a suffix of X, and similarly for t and y. If 2; is a suffix oi x, set x = pz. But 
then zp"' is a prefix of p{g"'vd^) for all n, contradicting the fact that Irr(X) 
is finite. Thus a; is a suffix of z and similarly y is a. suffix of t. Set z = px and 
i = qy. Then = xpx and p{d) — yqy. Since g'^vd^ = xp^xvyq^y and 
Irr(X) is finite, one has \p\ — \q\ and and moreover p is a factor of (f. ■ 

A pair {g.,d) such that X gXd is a lifting pair if the word p in 
Lemma 6.3 is nonempty, it is a flat pair if p = £. 

Lemma 6.4 If X —^giXdi or X g2Xd2 is a lifting pair, then the com- 
pound pair X gig2Xd2di is a lifting pair. 

Proof. According to Lemma 6.3, gi = xipiXi and g2 = X2P1X2. Assume the 
compound pair is flat. Then a;iPia;iX2PiX2 = zz for some word z e A*. Thus 
the number of barred letters is the same as the number of unbarred letters 
at both sides. This implies that pi and p2 are the empty word. ■ 

Lemma 6.5 The language L has bounded width iff G has no flat pair. 

Proof . If there is a flat pair [g, d) in G, then L has an inflnite surface. 
Indeed, ug'^vd'^w G L for all n and for some and since g = xx, there is 
a conjugate of g in D. Thus g'" has a factor in and L has unbounded 

width. 

Conversely, assume that L has unbounded width. Let K be the maximum 
of the lengths of the right-hand sides of the productions in G. Let m be an 
integer that is strictly greater than the maximum of the length of the words in 
the (flnite) sets lrr{X) ior X eV. Consider a word ZU1U2 ■ ■ ■ unz' G L with 
ui, . . . ,un G D, for some large integer N to be flxed later. In a derivation 
tree for this word, let Xq be the deepest node such that the tree rooted 
at Xq generates a word containing the factor U1U2 ■ ■ ■Ui\f. The production 
applied at that node has the form X — > Yi • • • with Yi, . . . ,Yk G VUT and 
k < K. By the pigeon-hole principle, at least one of Yl, . . . , generates a 
word containing a factor that is a product of at least N/k — 1 > N/K — 1 
consecutive -Uj's. Denote this nonterminal Xi. UN is large enough, on 
constructs a sequence Xq, Xi, . . . , X^ of nonterminals, and if h > m- Card V, 
there are at least m of these variables that are the same. A straightforward 
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computation shows that N > 
pairs 

Y 
Y 

where each of Wi, . . .Wm is in D*, the Sj and pi are suffixes (resp. prefixes) 
of words in D, and PiS2,P2S3, ■ ■ - Pm-iSm are Dyck primes. For each i, define 
Xi e A* by setting Xi = p{si). From p{piSi+i) = e, it follows that p{pi) = Xi+i. 
Thus SiWiPi = XiXi+i. In view of Lemma 6.3, there are words yi e A* such 
that Xi+i = Ui+iXi for i = l,...m — 1, and each siWiPi is equivalent to 
Xifji+iXi, which in turn is equivalent to Xiy2 ■ ■ ■ ViVi+iVi ■ ■ ■ y2Xi. All Xiy2 ■ - -yi 
are prefixes of words in Irr(y), and since this set is finite, one of the yi is the 
empty word because of the choice of m. This shows that one of the pairs is 
flat. ■ 

We now need to prove that it is decidable whether there exists a flat pair. 

Lemma 6.6 Assume that X (-{Yri, Y gYd and Y t2Xr2- If the 
pair X ligl2Xr2dri is flat, then the pair Y gYd is flat. 



K + + ■■■ K""- ^''"^^ is convenient. We get 
— ^ SiWiPiYdi 

— ^ S2W2P2Yd2 



Proof. According to Lemma 6.3, iigi2 = zz and g = xpx for some z,x,p e 
A*. Thus, £igi2 has the same number of barred and of unbarred letters, 
and g has more (or as many) unbarred letters than barred letters. Next, 

X ^1^2-^ ^"2^"! is an iterating pair, and therefore £1^2 has more unbarred 
letters than barred letters. Thus g has as many unbarred letters than it has 
barred letters. It follows that p is the empty word. ■ 

Proof of Theorem 6.1. In view of Lemma 6.5, it suffices to check whether 
the grammar has a fiat pair. For this, consider the derivation tree associated 
to a pair X gXd. We call this tree (and the pair) elementary if there 
is no variable that is repeated on the path from the root X to the leaf X. 
Lemmas 6.4 and 6.6 shows that if there is a flat pair, then there is also an 
elementary flat pair. 

To each elementary pair, we associate a skeleton defined as follow. Con- 
sider the path X = Xq, Xi, . . . , X„ = X from the root X to the leaf X. Each 
of the Xj+i is in the right-hand side of some production Xi uJi. The skele- 
ton is the derivation obtained by composing these productions. It results in 
a derivation X UXU', for some U,U' e (VUT)*. There are only a finite 
number of skeletons because each skeleton is built from an elementary pair. 
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For each skeleton X — > UXU', we consider the set of pairs X — >uXu' 
for all u e Irr ([/),«' e lii{U') (Irr(t/) denotes the set of reduced words of 
words deriving from U). Since all Irr(C/) is finite, the set of pairs obtained is 
finite. It suffices to check whether there is a fiat pair among them. ■ 

As a final remark, we consider grammars and languages similar to paren- 
thesis grammars and languages studied by McNaughton [7] and by Knuth [5] . 
We will say more about them in Section 8. A polyparenthesis grammar is a 
grammar with a terminal alphabet T — AU A, and where every production 
is of the form X — >ama, with m & V* , a & A, a & A. A polyparenthesis 
language is a language that has a polyparenthesis grammar. Thus, poly- 
parenthesis grammars differ from XML-grammars in two aspects: there are 
only finitely many productions, and the non-terminal need not to be unique 
for each pair (a, a) of letters. 

Proof of Corollary 6.2. Let G be a context-free grammar G over A U A gen- 
erating L = L{G). It is decidable whether L C Da for some letter a E A 
(Corollary 5.2). If this holds, we check whether L has finite surfaces. This 
is decidable (Theorem 6.1). If this holds, we proceed further. A generaliza- 
tion of an argument of Knuth [5] shows that it is decidable whether L is a 
polyparenthesis language, and it is possible to effectively compute a poly- 
parenthesis grammar G' for it. On the other hand, let G" be the standard 
grammar obtained from the (finite) surfaces. The language L is XML if and 
only if L = L{G"), thus if and only ff L{G') = L{G"). This equality is 
decidable. Indeed, any XML-grammar with finite set of productions is poly- 
parenthetic, and equality of polyparenthesis grammars is decidable [7] . ■ 

7 Regular XML languages 

Most of the XML languages encountered in practice are in fact regular. 
Therefore, it is interesting to investigate this case. The main result is that, 
contrary to the general case, it is decidable whether a regular language is 
XML. Moreover, XML-grammars generating regular languages will be shown 
to have a special form: they arc sequential in the sense that its nonterminals 
can be ordered in such a way that the nonterminal in the lefthand side of 
a production is always strictly less than the nonterminals in the righthand 
side. The main result of this section is 

Theorem 7.1 Let K C Da he a regular language. It is decidable whether K 
is an XML-language. 

One gets the following structure theorem. 
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Proposition 7.2 Let K he an XML-language, generated by an XML-gram- 
mar G. Then K is regular if and only if the grammar G is sequential. 

We shall give two proofs of Theorem 7.1, based on the two characteriza- 
tions of XML- languages given above (Theorem 4.2 and Theorem 4.4). Both 
proofs require the effective computation of surfaces. 

Lemma 7.3 Let K C Da he a regular language. The surfaces of K are 
effectively computable regular sets. 

Proof . Let .4. be a finite automaton with no useless states recognizing K. For 
each pair (p, q) of states, let Kp^q be the regular language composed of the 
labels of paths starting in p and ending in q. A pair (p, q) of states is good 
for the letter a in A, if Kp q fl Da 7^ 0. This property is decidable. A pair is 
good if it is good for some letter. Let G be the set of good pairs, considered 
as a new alphabet, and consider the set M(a) over G composed of all words 

(P0,Pl)(Pl,P2)---(Pn-l,Pn) 

such that there is an edge ending in po in the automaton A and labeled by 
a and there is an edge starting in p„ labeled by a. Clearly, M(a) is a (local) 
regular language over G. 

Consider now the finite substitution / from G* into A* defined by 

/(p, g) = {a e A I (p, q) is a-good} 

Then f{M{a)) is the surface of a in K, that is f{M{a)) = Sa{K). This 
proves the lemma. ■ 

First proof of Theorem 7.1. We use Theorem 4.2. Let X be a regular subset 

of Da- It is decidable whether K C Da^ for some letter a^. If this holds, 
then by Lemma 7.3, the family S of surfaces Sa{K) is effectively computable. 
From this family, one constructs the standard language L associated to S. 
This is effective. We know that K C L, and consequently K is an XML- 
language if and only if L C X or equivalently if and only if LUK' — 0, where 
K' — {AU A)* \ K is the complement of K. This is decidable. ■ 

Second proof of Theorem 7.1. We use Theorem 4.4. Let A be the minimal 
finite automaton with no useless states recognizing K, with initial state i 
and set of final states T. For each pair (p. q) of states, let Kp g be the regular 
language composed of the labels of paths starting in p and ending in q. For 
each letter a in A, the set Fa,p,g = Kp^g n Da is the set of well- formed factors 
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of K starting with the letter a that are labels of paths from p to q. Clearly, 
Fa,p,q C Fa{K), for all p,q. We show that all words in Fa{K) have same 
context if and only if -Fa,p,q = Fa{K), for all p, q such that Fa,p,q ^ 0. 

Assume first that all words in Fa{K) have same context. Let p,q such 
that Fa^p^q 7^ 0, and consider a word w G Fa^p^q. There exist words x and y 
such that i ■ X = p, and q ■ y E T. The pair (x, |/) is a context for w. Let w' 
be a word in Fa{K). Then there is a successful path with label xw'y. Thus 
there is a state q' such that p-w' — q' and q' -y li q ^ q', there is a word 
z separating q and q', because A is minimal. Thus q ■ z E T and q' ■ z ^ T 
or vice-versa. However, this means that 2;) is a context for w and is not 
a context for u/ or vice- versa. Thus q = q' and G -Fa,p,q. This prove that 
Fa{K) C F,,,,,. 

Conversely, assume that i^o,p,g = -f'a(-f^), for all p, such that -Fa,p,? 7^ 0- 
The contexts of any word w G Fa{K) is the union of sets Ki^p x Xg^t over all 
pairs [p, q) with Fa^p^q 7^ 0. Thus all words have same contexts. 

It follows from the preceding claim that i^T is a XML-language if and 
only if Fa,p,g — Fa,p',q' fo^ ^11 pairs for which the languages are not empty. 
Although equality of context-free languages in not decidable in general, this 
particular equality is decidable because Fa,p,? = Fa,p',q' iff 

Da n {Kp^q \ Kp,^q, U Kp, ^q, \ Kp^q) = 

For the proof of Proposition 7.2 we use the following notation and result. 
For any word w G {AUA)*, the weight of w is the number |if \w\a- Here, 
\u\a is the number of occurrences of letters in A in the word u. The height 
of w is the number 

h{w) — max{|ti|^ — \u\a \ uv = w} 

that is the maximum of the weights of its prefixes. The height of a language 
is the maximum of the heights of its words. This is finite or infinite. 

Proposition 7.4 Let K C Da be a language over AU A. If K is regular, 
then it has finite height. 

Proof. This result is folklore. We just sketch its proof. Given an automaton 
recognizing K, the weight \u\a — \u\a of the label m of a circuit must be 
zero for every circuit, by the pumping lemma. Thus, the height of K is the 
maximum of the heights of the labels on all acyclic successful paths in the 
automaton augmented by the sum of the heights of all its simple cycles. Since 
the automaton is finite, this number is finite. ■ 
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Proof of Proposition 7.2. Consider an XML-grammar G, and construct a 
graph with an edge {Xa, X^) whenever Xf, appears in the righthand side 
of a production with Xa as lefthand side. Nonterminals can be ordered 
to fulfill the condition of a sequential grammar if and only if the graph 
has no cycle. If the graph has no cycle, then the language generated by 
a variable of index i is a regular expression of languages of higher indices. 
Thus, the language generated by the grammar G is regular. On the contrary, 
if there is a cycle through some variable X^, then there is a derivation of the 
form Xa — > auXaVa for some words u, v. By iterating this derivation, one 
constructs words of arbitrary height in K, and so K is not regular. ■ 

Note that the language Fa{K) of well-formed factors is regular when K 
is a regular XML-language, because Fa{K) is the language generated by the 
nonterminal Xa in a sequential grammar. 

8 Historical Note 

There exist several families of context-free grammars related to XML-gram- 
mars that have been studied in the past. In the sequel, the alphabet of 
nonterminals is denoted by V . 

Parenthesis grammars. These grammars have been studied in particular 
by McNaughton [7] and by Knuth [5] . A parenthesis grammar is a grammar 
with terminal alphabet T = B U {a,a}, and where every production is of 
the form X — > ama, with m E {B U V)* . A parenthesis grammar is pure if 
-5 = 0. In a parenthesis grammar, every derivation step is marked, but there 
only one kind of tag. 

Bracketed grammctrs. These were investigated by Ginsburg and Harrison 
in [3]. The terminal alphabet is of the form T = AU B U C and productions 
are of the form X — > amb, with m G {VUCy. Moreover, there is a bijection 
between A and the set of productions. Thus, in a bracketed grammar, every 
derivation step is marked, and the opening tag identify the production that 
is applied (whereas in an XML-grammar they only give the nonterminal). 

Very simple grammars. These grammars were introduced by Korenjak 
and Hopcroft [6], and studied in depth later on. Here, the productions are 
of the form X — >am, with a E A and m E V*. In a simple grammar, the 
pair (a, m) determines the production, and in a very simple grammar, there 
is only one production for each a in A. 
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Chomsky-Schiitzenberger grammars. These grammars are used in the 
proof of the Chomsky-Schiitzenberger theorem (see e. g. [4]), even if they 
were never studied for their own. Here the terminal alphabet is of the form 

T = A U A U Bj and the productions are of the form X — > ama. Again, 
there is only one production for each letter a & A. 

XML-grammars differ from all these grammars by the fact that the set 
of productions is not necessarily finite, but regular. However, one could con- 
sider a common generalization, by introducing balanced grammars. In such 
a grammar, the terminal alphabet is T = AU AU B, and productions are of 
the form X — > ama, with m e (VUB)*. Each of the parenthesis grammars, 
bracketed grammars, Chomsky-Schiitzenberger grammars are balanced. If 
B = 0, such a pure grammar covers XML-grammars with finite surfaces. If 
the set of productions of each nonterminal is allowed to be regular, one gets 
a new family of grammars with interesting properties. 
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